Introducing Linggle: From Concordance to Linguistic Search Engine
نویسنده
چکیده
We introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. Unlike a typical concordance, Linggle accepts queries with keywords, wildcard, wild part of speech (PoS), synonymous words, and additional regular expression (RE) operators, and returns bundles with frequency counts. In our approach, we argument Google Web 1T corpus with inverted file indexing, PoS information from BNC, and semantic indexing based on Latent Dirichlet Allocation. The method involves parsing the query to transforming the query to several keyword retrieval commands, retrieving word chunks with counts, filtering the chunks again the query as a RE, and finally displaying the results according the count, similarity, and topic. Clusters of synonymous or conceptually related words are also provided. In addition, Linggle provide example sentences from The New York Times on demand. The current implementation of Linggle is the most comprehensive functionally, and is in principle language and dataset independent. We plan to extend Linggle to provide a fast and convenient access to a wealth of linguistic information embodied in Web scale datasets including Google Web 1T and Google Books Ngram for many major languages in the World. For non-native speakers, doubts concerning the usage of a preposition, the mandatory presence of a determiner, the correctness of the association of a verb with an object or the need for synonyms of a term in a given context are problems that arise frequently when writing in English. Printed collocation dictionaries and reference tools based on compiled corpora offer limited coverage of word usage while knowledge of collocations is vital for the competent use of a language. We propose to address these limitations with a comprehensive system that truly aims at letting learners “know a word by the company it keeps”. Linggle (linggle.com) is a broad coverage language reference tool for English as Second Language learners (ESL). The system is designed to access words in context under various forms. First, we build inverted file index for the Google Web 1T Ngram to support queries with RE-like patterns including PoS and synonym matches. For example, for the query “$V $D +important role”, Linggle retrieve 4-gram chunks that start with a PACLIC-27
منابع مشابه
Linggle: a Web-scale Linguistic Search Engine for Words in Context
In this paper, we introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. The query might contain keywords, wildcards, wild parts of speech (PoS), synonyms, and additional regular expression (RE) operators. In our approach, we incorporate inverted file indexing, PoS information from BNC, and semantic indexing based on Latent Dirichl...
متن کاملLinggle Knows: A Search Engine Tells How People Write
This paper presents Linggle Knows, an English grammar and linguistic search engine. Linggle Knows help people writing by displaying lexical and grammatical information extracted from a couple of large scale corpora, including Google Web 1T 5-gram, British National Corpus (BNC), New York Times Annotated Corpus (NYT), etc. It not only describes how a word is genuinely used, but also recommends va...
متن کاملAnalyzing the Sense Distribution of Concordances Obtained by Web as Corpus Approach
In corpus-based lexicography and natural language processing fields some authors have proposed using the Internet as a source of corpora for obtaining concordances of words. Most techniques implemented with this method are based on information retrieval-oriented web searchers. However, rankings of concordances obtained by these search engines are not built according to linguistic criteria but t...
متن کاملConcordance Of Snippet
Excellent concordances can be produced by tools mounted on regular web search engines but these tools are not suitable for quick lookups on the web because it takes time to collect ad-hoc corpora with occurrences of a queried word or phrase. It is possible to get a web concordance in an instant if the amount of transferred data can be limited. One way to do it is to use snippets from a search e...
متن کاملGETESS: Constructing a Linguistic Search Index for an Internet Search Engine
In this paper we illustrate how Internet documents can be automatically analyzed in order to capture the content of a document in a more detailed way than usually The result of the document analysis is called abstract and will be used as a linguistic search index for the Internet search engine GETESS We show how the linguistic analysis system SMES can be used for a Harvest based search engine f...
متن کامل